model security Flash News List

Flash News List

List of Flash News about model security

Time	Details
2026-07-15 21:21	OpenAI: Deploys AI Red Team for GPT-5.6 Defense OpenAI deploys AI red team to harden GPT-5.6 against prompt injection attacks, boosting model security protocols. Source
2026-01-09 21:30	Anthropic Reports Classifiers Cut Claude Jailbreak Rate from 86% to 4.4% but Increase Costs and Benign Refusals; Two Attack Vectors Remain According to @AnthropicAI, internal classifiers reduced Claude jailbreak success from 86% to 4.4%, indicating a substantial decrease in successful exploits. Source: @AnthropicAI on X, Jan 9, 2026, https://twitter.com/AnthropicAI/status/2009739654833029304 According to @AnthropicAI, the classifiers were expensive to run, impacting operational cost profiles for deployments. Source: @AnthropicAI on X, Jan 9, 2026, https://twitter.com/AnthropicAI/status/2009739654833029304 According to @AnthropicAI, the system became more likely to refuse benign requests after adding the classifiers. Source: @AnthropicAI on X, Jan 9, 2026, https://twitter.com/AnthropicAI/status/2009739654833029304 According to @AnthropicAI, despite improvements, the system remained vulnerable to two types of attacks shown in their accompanying figure. Source: @AnthropicAI on X, Jan 9, 2026, https://twitter.com/AnthropicAI/status/2009739654833029304 Source

Time

Details

2026-07-15
21:21

OpenAI: Deploys AI Red Team for GPT-5.6 Defense

OpenAI deploys AI red team to harden GPT-5.6 against prompt injection attacks, boosting model security protocols.

Source

2026-01-09
21:30

Anthropic Reports Classifiers Cut Claude Jailbreak Rate from 86% to 4.4% but Increase Costs and Benign Refusals; Two Attack Vectors Remain

According to @AnthropicAI, internal classifiers reduced Claude jailbreak success from 86% to 4.4%, indicating a substantial decrease in successful exploits. Source: @AnthropicAI on X, Jan 9, 2026, https://twitter.com/AnthropicAI/status/2009739654833029304 According to @AnthropicAI, the classifiers were expensive to run, impacting operational cost profiles for deployments. Source: @AnthropicAI on X, Jan 9, 2026, https://twitter.com/AnthropicAI/status/2009739654833029304 According to @AnthropicAI, the system became more likely to refuse benign requests after adding the classifiers. Source: @AnthropicAI on X, Jan 9, 2026, https://twitter.com/AnthropicAI/status/2009739654833029304 According to @AnthropicAI, despite improvements, the system remained vulnerable to two types of attacks shown in their accompanying figure. Source: @AnthropicAI on X, Jan 9, 2026, https://twitter.com/AnthropicAI/status/2009739654833029304

Source